OSSM-4815: Document HA for a mesh #96010

rh-tokeefe · 2025-07-11T15:30:31Z

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:
https://96010--ocpdocs-pr.netlify.app/openshift-service-mesh/latest/install/ossm-installing-openshift-service-mesh.html#ossm-about-istio-high-availability_ossm-customizing-istio-configuration

QE review:

QE has approved this change.

Additional information:

openshift-ci-robot · 2025-07-11T15:30:36Z

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-07-11T15:32:14Z

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ocpdocs-previewbot · 2025-07-11T15:37:25Z

🤖 Tue Jul 15 21:23:08 - Prow CI generated the docs preview:

https://96010--ocpdocs-pr.netlify.app/
https://96010--ocpdocs-pr.netlify.app/openshift-service-mesh/latest/install/ossm-installing-openshift-service-mesh.html

openshift-ci-robot · 2025-07-11T20:23:17Z

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:
https://96010--ocpdocs-pr.netlify.app/openshift-service-mesh/latest/install/ossm-installing-openshift-service-mesh.html#ossm-about-istio-high-availability_ossm-customizing-istio-configuration

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

fjglira

I left some minor changes

fjglira · 2025-07-14T11:40:34Z

modules/ossm-about-istio-high-availability.adoc

+
+Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} control plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.
+
+There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.


Suggested change

There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.

There are two ways for a system administrator to configure HA for the `istiod` deployment:

* Defining a static replica count: This involves setting a fixed number of `istiod` pods, providing a consistent level of redundancy.

* Using autoscaling: This dynamically adjusts the number of `istiod` pods based on observed resource utilization or custom metrics, offering more efficient resource consumption for fluctuating workloads.

I think adding a preview here will be better to give the users a first approach to what the configuration types are

fjglira · 2025-07-14T11:48:43Z

modules/ossm-configuring-istio-ha-autoscaling.adoc

+[id="ossm-configuring-istio-ha-autoscaling_{context}"]
+= Configuring Istio HA by using autoscaling 
+
+Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation in response to the varying number of workloads in the mesh.


Suggested change

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation in response to the varying number of workloads in the mesh.

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation based on observed resource utilization (such as CPU or memory) or custom metrics, effectively responding to the varying number of workloads and overall traffic patterns within the mesh.

fjglira · 2025-07-14T12:15:55Z

modules/ossm-configuring-istio-ha-autoscaling.adoc

+<1> Defines the minimum number of {istio} control plane replicas that always run. 
+<2> Defines the maximum number of {istio} control plane replicas, allowing for scaling based on load. To support HA, there must be at least two replicas.


It would be highly beneficial to add a note here describing the specific metrics that can be used to configure Istiod autoscaling (scale up/down). For example, users can set spec.values.pilot.cpu.targetAverageUtilization and spec.values.pilot.memory.targetAverageUtilization to define CPU and Memory thresholds for triggering scaling actions. Sorry for not adding this also in the upstream docs, but I'll add it there. I think it's good to point the users which configuration is going to trigger this

fjglira · 2025-07-14T12:17:19Z

modules/ossm-configuring-istio-ha-autoscaling.adoc

+istiod-7c7b6564c9-xkmsl   1/1     Running   0          85s
+----
+
+Two `istiod` pods are running, which indicates HA was successfully configured.


Suggested change

Two `istiod` pods are running, which indicates HA was successfully configured.

Two `istiod` pods are running, which is the minimum requirement for a highly available Istio control plane and indicates a basic HA setup is in place.

fjglira · 2025-07-14T12:53:46Z

modules/ossm-configuring-istio-ha-replicacount.adoc

+[id="ossm-configuring-istio-ha-replicacount_{context}"]
+= Configuring Istio HA by using replica count
+
+Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments in which the number of workloads does not scale.


Suggested change

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments in which the number of workloads does not scale.

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments where the control plane workload is relatively stable or predictable, or when manual scaling of the `istiod` is preferred.

openshift-ci · 2025-07-15T21:24:14Z

@rh-tokeefe: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 11, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 11, 2025

rh-tokeefe force-pushed the OSSM-4815 branch from 60ee8c8 to 69f8e50 Compare July 11, 2025 20:25

fjglira reviewed Jul 14, 2025

View reviewed changes

OSSM-4815: Document HA for a mesh

d1ed0d4

rh-tokeefe force-pushed the OSSM-4815 branch from a03245f to d1ed0d4 Compare July 15, 2025 21:11


		Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} control plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.

		There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.

-There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.
+There are two ways for a system administrator to configure HA for the `istiod` deployment:
+* Defining a static replica count: This involves setting a fixed number of `istiod` pods, providing a consistent level of redundancy.
+* Using autoscaling: This dynamically adjusts the number of `istiod` pods based on observed resource utilization or custom metrics, offering more efficient resource consumption for fluctuating workloads.

		<1> Defines the minimum number of {istio} control plane replicas that always run.
		<2> Defines the maximum number of {istio} control plane replicas, allowing for scaling based on load. To support HA, there must be at least two replicas.

	Two `istiod` pods are running, which indicates HA was successfully configured.
	Two `istiod` pods are running, which is the minimum requirement for a highly available Istio control plane and indicates a basic HA setup is in place.

OSSM-4815: Document HA for a mesh #96010

Are you sure you want to change the base?

OSSM-4815: Document HA for a mesh #96010

Conversation

rh-tokeefe commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 11, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 11, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocpdocs-previewbot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 11, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjglira left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jul 15, 2025

Uh oh!

Uh oh!

rh-tokeefe commented Jul 11, 2025 •

edited

Loading

openshift-ci-robot commented Jul 11, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jul 11, 2025 •

edited by openshift-ci bot

Loading

ocpdocs-previewbot commented Jul 11, 2025 •

edited

Loading

openshift-ci-robot commented Jul 11, 2025 •

edited by openshift-ci bot

Loading